Academic Open Internet Journal ISSN 1311-4360 |
Volume 19, 2006 |
Efficient and scalable partition based algorithms for
mining association rules
Muthukumar A1
Nadarajan R2
1
Corresponding author,
Assistant
Professor,
Department
of Computer Applications,
Kumaraguru
College of Technology, Coimbatore-641 006.
Phone:
91-9345166161
Fax: 91-422-2669406
amk_in_2000@yahoo.com
2 Dean
& HOD,
Department
of Mathematics & Computer Applications,
PSG
College of Technology, Coimbatore-641 004.
Phone:
91-422-5344157,5344777
Fax: 91-422-2573833
nadarajan_psg@yahoo.co.in
The
information revolution is generating mountains of data from sources like
astronomy observations, credit card transactions, genetics research, telephone
calls, and web clickstreams. The
progress in data mining and knowledge discovery since the first workshop on
knowledge discovery in databases has been tremendous. The research in data mining is developing fast and efficient
algorithms to derive knowledge from huge databases. There are several data
mining algorithms available to solve diverse data mining problems. They are mainly classified as associations,
classifications, sequential patterns, and clustering. One of the knowledge and discovery in databases (KDD) operations
is the problem of inducing association rules. One of the defining challenges
for the KDD research community is scaling up data mining algorithms to mine
very large collections of data. In this
paper, we have analyzed graph based association rule mining algorithms like
primitive association rule mining, generalized association rule mining and
multilevel association rule mining from the perspective of improving
scalability and performance. We have
achieved scalability and performance by dividing the transaction database into
partitions of equal size and brought one partition at a time to memory. Primitive association rule mining, which
is functionally similar to Apriori, outperforms Apriori. Generalized association rule mining provides
extra knowledge as sibling associations and even cross-parent
associations. Multilevel association rule
mining algorithm takes care of analyzing different level wise knowledge.
Keywords:
Data mining algorithms, Association Rule Mining,
Graph-based approach, Generalized Association Rule Mining, Multilevel
Association Rule Mining.
With the rapid growth in size and number of available
databases in commercial, industrial, administrative and other applications, it
is necessary and interesting to examine how to extract knowledge automatically
from huge amount of data [5]. Knowledge
discovery in databases, or Data Mining, is the effort to understand, analyze,
and eventually make use of huge volume of data available. Through the
extraction of knowledge in databases, large databases will serve as a rich,
reliable source for knowledge generation and verification and the discovered
knowledge can be applied to information management, query processing,
decision-making, process control and many other applications. Therefore, data mining has been considered
as one of the most important topics in databases by many database researchers.
There are
several data mining techniques available to solve diverse data mining
problems. They are mainly classified as
associations, classifications, sequential patterns, and clustering [3]. One of the KDD operations is the problem of
inducing association rules.
The
problem of mining association rules [3] in large databases is to generate all
association rules, of the form X ̃ Y, that have support and confidence greater than the
user-defined minimum support and minimum confidence. Formally, Let I= {i1,i2,….in} be a set of
items. Let D be our database, where
each transaction T is a set of items such that T < I. We say that a transaction T contains X, a
set of some items in I, if X Í T. An association rule is an implication of the
form X ̃ Y, where
X ̀ I, Y ̀ I,
and X Ç Y =F. The rule X ̃ Y holds in the database D with confidence c if at least c%
of transactions in D that contain X also contain Y. The rule X ̃ Y has
support s in the database if at least s% of transactions in D contains XY. Several algorithms for finding association
rules were developed in the last years.
These algorithms find all the rules that satisfy a given support /
confidence criterion. Most algorithms
have a common main structure, a two-step process: finding all sets of items
that have support above the given minimum, and generating the desired rules
from these itemsets.
The
Apriori algorithm is described as a “fast algorithm for mining association
rules” and is based on [1]. The
algorithm has the common structure of a two step process: First all large itemsets that have support
greater than the minimum support (minsup), are created incrementally. L1 , the large itemsets of size
1, is created in the first pass over the data, by simply counting the
appearance of each item in the data.
Subsequent L’s are created using their candidates. The candidates are potential large itemsets
of the current size. The candidates are
generated from the previous L using a fast and clever cross-product function. Then a pass over the data determines the
candidate’s support. All candidates
with support > minsup are placed in the next L. This process stops when Ln is empty. These large itemsets are then used to
produce rules. Each large itemset is
divided to all Head ̃ Body
combinations and each combination is tested for sufficient confidence. Here confidence is sup(Itemset)/ sup(Head).
If the confidence tested is greater than the user defined minsup then the rule
Head ̃ Body is
valid and is kept in a list.
The
disadvantage of this algorithm is number of data scans n where n is the size of
large nonempty itemset. Also the number of discovered rules is huge while most
of them are non-interesting. Therefore
several improved algorithms were proposed after Apriori for efficiency and
scalability.
One such
algorithm is graph-based approach [2] to discover various types of association
rules, namely, primitive association rules, generalized association rules and
multiple-level association rules. We
would discuss these algorithms under related work. Since this work uses large main memory, we propose a new method
that uses main memory efficiently and thus become scalable. The rest of this paper is organized as
follows. Related work is presented in section 2, discovering scalable graph
based approach in section 3 , performance of our data mining algorithm in
section 4 and data mining applications in healthcare in section 5. Finally we conclude the paper and present
directions for future research in section 6.
The graph
based method consists of three algorithms namely, finding a primitive
association rule, generalized association rule and multilevel association rules
[2]. Primitive association rule mining
is mining of association rules that describe the association among items of the
transaction in the database. A
primitive association pattern is a large itemset in which each item is a
database item. A uniform data mining
framework was proposed in [2] for primitive association rule to discover association
rules is given below.
q Numbering
phase: In this phase, all the items are assigned an integer number.
q Large item
generation phase: This phase generates large items and records related
information. A large item is an item
whose support is no less than a user specified minimum support.
q Association
graph construction phase: This phase constructs an association graph to
indicate the association between large items.
q Association
pattern generation phase: This phase generates all association patterns by traversing
the constructed association graph.
q Association
rule generation phase: The association rule can be generated directly according
to the corresponding association patterns.
In [2], another algorithm proposed was Generalized
Association Pattern Generation. It
discovers all generalized association patterns. To generate generalized association patterns, one can add all
ancestors for each items from concept hierarchy [4] and then apply the
algorithm on the extended transactions.
The
multiple-level association rules [6] are discovered from a large database of
customer transactions in which all items are described by a set of relevant
attributes. Each attribute represents a
certain concept and these relevant attributes form a set of multiple-level concepts.. For example, if the “category”, ”content”,
and “brand” of an item have the domain values “bread”, ”wheat”, and “Wonder”,
respectively, then this item is described as
“Wonder wheat bread” in the database.
Multilevel association graph is constructed by considering one level at
a time.
In all the
three algorithms in [2], the entire set of transactions is brought into main
memory and then the graph is created.
From graph, the association patterns and rules are generated. For large database this method is not
usable. Hence we propose a method in
which the set of transactions is divided into number of partitions and one
partition at a time is brought into main memory and the support counts of
associations are calculated. Only the
association pattern is kept in the main memory and not the entire set of
transactions and hence the method is usable and scalable.
In this new approach we propose a new framework,
which is scalable and efficient. The
entire database is divided into partitions of equal size. Each partition is considered one at a time
by loading the first partition into memory and calculating large itemsets and
the corresponding support counts. Then second partition is considered similarly
and the cumulative support count is calculated for the cumulative large
itemsets. This process is continued for the entire set of partitions and
finally we have the whole large itemsets and the corresponding cumulative
support counts. This approach reduces
main memory requirement since it considers only a small partition at a time and
hence it is scalable for any large size of the database. We have developed three different algorithms
for primitive association rule mining, generalized association rule mining and
multilevel association rule mining using partition approach.
Example 3.1 : Let us consider primitive rue
mining for the database in table 1.
TID |
Itemset |
100 |
I1, I2, I5 |
200 |
I2, I4 |
300 |
I2, I3 |
400 |
I1, I2, I4 |
500 |
I1, I3 |
600 |
I2, I3 |
700 |
I1, I3 |
800 |
I1, I2, I3, I5 |
900 |
I1, I2, I3 |
Table1 : Transaction
database
This table consists of nine transactions. Let us divide this database into three
partitions of equal size say three transactions each. First we consider first partition only. That is transactions 100, 200, 300 in first partition. We consider bit vectors for each these
transactions [7,8,9, 10]. Then we find
their associations by finding logical AND of these bit vectors namely BV1 ^
BV2, BV1 ^ BV3, BV1 ^ BV4, BV1 ^ BV5. Similarly we find logical ANDing
namely, BV2 ^ BV3, ……., BV4 ^ BV5. We
calculate support counts of these results.
This is continued for rest of the partitions. The bit vector for partitions is given in table below. Here in
the table 1, there are 9 transactions and 5 items. We shall consider the transactions into 3 partitions consisting
of 3 transactions each. We identify bit
vectors for the items in each partition as in table 2.
ID |
Partition 1 |
Partition 2 |
Partition 3 |
I1 |
1 0 0 |
1 1 0 |
1 1 1 |
I2 |
1 1 1 |
1 0 1 |
0 1 1 |
I3 |
0 1 1 |
0 1 1 |
1 1 1 |
I4 |
0 1 0 |
1 0 0 |
0 0 0 |
I5 |
1 0 0 |
0 0 0 |
0 1 0 |
Table 2: Partitioned
database
We perform
logical AND one partition at a time. I1
is AND’ed with I2, I3, I4, I5 for the first partition and the result that
satisfy minimum support is considered. Then I2 is AND’ed with I3, I4, I5 for
the first partition. This is continued for I3, I4 and I5 for the first
partition. Once first partition calculation is over, the memory buffer contains
large itemsets for first partition with corresponding support count. The above
procedure is repeated for all partitions.
And we get cumulative support and large itemsets. Then we construct the graph and traverse it
to find patterns and using these patterns to find rules.
Here, the
association patterns of order 3 are I1 I2 I3, I1 I2 I5 and the rules are I1 ̃ I2 I3, I1 ̃ I2 I5, I1
I2 ̃ I3, I1 I2 ̃ I5, etc. The corresponding graph in figure 1.
The
generalized association rule mining is done in a similar manner using small
partitions of the database into main memory one at a time. Here we maintain a separate list of parents
for the concept hierarchy tree. Using
this parent list the parents of each item are added into the transactions and
this updated transaction is mined for generalized association rules.
Figure
1 : Association graph
Example
3.2
Let us
consider the concept hierarchy for the database..
Here, the
items including parents are I1, I2, …., I13. We consider the extended
transactions after adding parents from the parents list. Here we represent
using binary vectors of each item.
After logical ANDing of I1^I2, I1^I3, ….., I12^13, we get the associated
graph.
While
constructing the graph there is no edge from a child to a parent and vice
versa. Then from the graph traversal
is done to construct the patterns (like (I6,I10,I11) & (I7,I10,I11) ) and
they are used for finding generalized association rules (like I6 ̃ I10,I11
and I7̃ I10,I11 etc.).
The third
algorithm multilevel association rule mining is considered for each level
separately. For example if the database
considers a three level items namely ‘category’, ‘content’ and ‘brand’
represents for example “Wonder wheat bread” where category is bread, content is
wheat and brand is Wonder. Here the
association rule is constructed level wise.
For example category wise irrespective of content and brand, or category
and content wise irrespective of brands or by considering all three levels
namely category, content and brand.
We can
either represent this database as level-3 (ex. 111) or level-2 (ex. 11*) or
level-1 (ex. 1**). Here after identifying the level, level wise database is
partitioned into small partitions, which are brought to memory one at a time
and the corresponding large itemsets with support counts, are identified. Similarly this is repeated for second
partition, etc. At the end of last partition the graph is constructed by taking
into account the cumulative itemsets with cumulative support counts.
The graphs
are traversed for finding the patterns (ex. 11* 12* 22*, 11* 21* 22* etc.).
Then rules are constructed (like 11* ̃ 12*, 22*
and 11* ̃ 21*,
22* etc.). Other obvious rules are 111 ̃ 211 and 1** ̃ 2**.
For
each algorithm, five different databases were considered with sizes of 10K,
20K, 30K, 40K and 50K transactions of synthetic data. First we compare graph mining method with our proposed
partitioned graph mining. The
performance shows encouraging results as in Figure 2 - 6.
Figure
2 : GM vs. PGM
Here
the x-axis shows size of data base in number of records and y-axis shows the
execution time in seconds.
Next,
the generalized graph mining is compared with partitioned generalized graph
mining, which is shown in Figure 3.
Figure 3: GGM Vs. PGGM
Multilevel association rule mining using graph based
approach and partitioned approach are compared level-by-level. In all these levels, the results are
encouraging. The level-1, level-2 and
level-3 graphs are shown in Figures 4-6.
Figure
4 : Level 1 Graph
Figure 5 : Level 2 Graph
Figure 6: Level 3 Graph
In
all the above performances show partitioned graph approach is much better than
graph based mining of association rules and hence it is efficient and scalable.
Today,
the major hospitals are corporate hospitals with computerized information in
place. The disease of the patients,
treatment history, medicines are all available in the database for each patient.
We can mine this database for possible association of diseases and symptoms
using association rule mining illustrated in this paper. Also the considering the medicines as the
items purchased, the shopping behaviors can be analyzed. Other data mining techniques like
clustering, classification; outlier analysis can also be applied suitably for
the medical database. Also the
bio-informatics research is fast growing with integration of data mining
techniques like in the fields of drug design etc.
Data Mining, recently, is becoming much more
important as the number of databases and database size keeps growing. Researchers in many different fields have
shown great interest in data mining.
Different data mining tasks are association rule mining, classification,
clustering, sequential pattern mining, OLAP and Web mining. The task of association rule mining is to
find certain association relationships among a set of objects in a database. The association relationships are described
in association rules. There are
different algorithms for association rule mining namely, apriori, fp-growth,
graph based association rule mining.
Here we have discussed the limitation of graph based association rule
mining namely memory usage. Our approach
divides the given database into number of partitions and working with the
partitions one at a time in the memory and thus avoids bringing the whole
database into memory. We have analyzed
and applied this approach into all the three kind of graph based association
rule mining namely primitive association rule mining (PARM), generalized
association rule mining (GARM) and multilevel association rule mining
(MARM). In all the cases, the
performance of partitioned based approach outperforms graph-based approach
mentioned in [2].
As all the databases are transaction databases, as a
future improvement, we have plans to analyze the raw transaction database and
convert into an optimized database using transaction database mining language,
which we are working concurrently. Also
we would like to apply these algorithms to suitable applications with real time
data.
Acknowledgements:
The authors express sincere thanks to the management
of PSG College of Technology, Coimbatore to carry out this research by providing
all resources. The first author
expresses his gratitude to his supervisor for his support in this work. Also he thanks the management of Kumaraguru
college of Technology for permitting him to do Ph.D. programme.
1.
R.Agrawal, R.Srikant, “Fast Algorithms for mining Association Rules”,
Proceedings of the 20th VLDB Conference, pp.487-499, Santiago,
Chile, 1994.
2.
Show-Jane Yen and Arbee L.P.Chen, “A Graph based Approach for Discovering
Various Types of Association Rules”, IEEE Transaction on Knowledge and Data
Engineering, Vol.13, No.5, pp. 839-845, Sep/Oct. 2001.
3. Jiawei Han, Micheline Kamber, “Data Mining Concepts and Techniques”,
Morgan Kauffman Publishers, San Francisco 2001.
4.
R.Srikant and R.Agrawal, “Mining Generalized Association rules”, In Proc. Of
the 21st Int. Conf. on Very Large Databases, pp.407-419, Zurich,
Switzerland, 1995.
5.
Willi Klosgen and Jan M. Zytkow, “Hand Book of Data Mining and Knowledge
Discovery”, Oxford University Press, 2002.
6.
Jiawei Han, Yongjian Fu, “Mining Multiple Level Association Rules in Large
Databases”, Proc. of 1995 Int'l Conf. on Very Large Data Bases (VLDB'95), pp.
420-431, Zurich, Switzerland, 1995.
7.
Muthukumar A, Nadarajan R, “An Efficient Approach to Mining Association Rules
in Transaction Databases”, Indian Journal of Information Science and
Technology, Irofis Publications, Vol. 1, No. 1, pp. 10-17, Nov. 2005.
8.
Muthukumar A, Nadarajan R, “ A Novel Approach for Mining Association Rules in
Dynamic Databases”, Proc. Of National Conference
on Advanced Computing (NCAC2006), MIT, Anna University, Chennai, pp.142-147, Feb. 2006.
9.
Jay Ayres, Johannes Gehrke, Tomi Yiu, Jason Flannick, “Sequential Pattern
Mining using a bit map representation”,
KDD 02, pp. 429-435, 2002.
10.
Mohammed J. Zaki, Ching-Jui Hsiaso, “CHARM : An Efficient Algorithm for Closed
Itemset Mining”, SDM 2002.